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It is shown that the 21264 Alpha processor can reach about 20% sustained efficiency for the inversion of the 
Wilson-Dirac operator. Since fast ethernet is not sufficient to get balancing between computation and com- 
munication on reasonable lattice- and system-sizes, an interconnection using Myrinet is discussed. We find a 
price/performance ratio comparable with state-of-the-art SIMD-systems for lattice QCD. 



1. Introduction 

The urgent need for cheap sustained compute 
power for lattice QCD (LQCD) provides a strong 
motivation to fathom the potential of PC or work- 
station clusters. It is not a long time ago that 
PCs and workstations have become both speedy 
and cheap enough to render their clustering in 
commodity networks economical, in view of lo- 
cal performance, scalability and total system size. 
Moreover, to render clusters efficiently one needs 
open source operating systems such as Linux. 
The apparent success of Beowulf clusters and the 
tremendous peak compute power of Alpha pro- 
cessors as realized in the Avalon cluster [|l| im- 
mediately have called the attention of the lattice 
community. 

We are going to investigate two different cluster 
approaches, both based on Compaq Alpha pro- 
cessors: 

One system (NICSE-TS) we have designed 
and benchmarked, using state-of-the-art itera- 
tive solver codes, is a four-node cluster of 533 
MHz 21164 EV56 Alpha processors, installed as 
a test-system at the John von Neumann-Institut 
fiir Computing in Jiilich/ Germany and operated 
under Linux. Since QCD involves only nearest- 
neighbor interaction, a mesh based connectivity 
appeared to be the natural parallel architecture in 
order to handle the ensuing interprocessor com- 
munication between the nodes. 

Our second test-cluster (ALiCE-TS) has been 



designed with respect to the experiences gained 
by NICSE-TS. Besides the shift to 21264 EV6 
Alpha processors we are using Myrinet, a Gbit 
network. This promises the interprocessor con- 
nectivity to be fast enough to compute LQCD 
on Alpha clusters. As Myrinet provides a multi- 
stage crossbar, we have given up the former mesh 
approach. This test system again consists of four 
workstations. We will show that ALiCE-TS is 
superior to the "cheap" NICSE-TS in terms of 
price/performance ratio by nearly a factor of two. 

The paper is organized as follows: in Section ^ 
we give the specifications for the two variant clus- 
ters tested. Section ^ describes our benchmark 
codes and contains some results and in Section U, 
we give price/performance ratios. 

2. The Testbeds 

The benchmark systems consist each of four 
single processor nodes with two different gener- 
ations of Alpha processors. The connectivity is 
fast Ethernet and Myrinet, respectively. 

2.1. NICSE-TS 

NICSE-TS is a four-node system with fast Eth- 
ernet connectivity. The system is located at NIC, 
FZ- Jiilich. The nodes are very similar to the 
Avalon-nodes, i.e. they contain: 

• 533 MHz 21164A Alpha microprocessors, 
2 MB 3'''' level cache, Samsung Alpha-PC 
164UX motherboards 



*TaIk presented by N. Eicker. 



• ECC SDRAM DIMMs (256 MB per node) 
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• D-Link DFE 500 TX Ethernet cards 

• MPI based on MPIch 

The main difference to Avalon is the network- 
setup. Where Avalon has an all-to-all network 
using switches, the NICSE-TS uses a 2-D torus. 
Thus we need four Ethernet cards per node where 
Avalon only employs one. On the other hand 
we do not need any switch. We expect, that the 
network performance scales to a large number of 
nodes for nearest neighbor communication. All- 
to-all communication can be achieved by the rout- 
ing capabilities of the Linux kernel. 

2.2. ALiCE-TS 

ALiCE-TS is a four-node cluster with switched 
Myrinet connectivity. This system is hosted at 
Wuppertal University. It includes: 

• 466 MHz 21264 Alpha microprocessors, 

2 MB 2"^^ level cache, Compaq DSIO moth- 
erboards 

• ECC SDRAM DIMMs (128 MB per node) 

• 64-bit 33MHz Myrinet-S AN/PCI interface 

• MPI based on Myrinet GM library 

ALiCE-TS has been purchased as prototype sys- 
tem for the design of the 128 node Wuppertal 
Alpha-Linux-Cluster Engine (ALICE). 

3. QCD Benchmarks and Results 

The computational key problem of LQCD is 
the — very often repeated — inversion of the Dirac 
matrix. It has been shown in that such sys- 
tems are most efficiently solved by Krylov sub- 
space methods like BiCGStab. State-of-the-art 
is the application of parallel local lexicographic 
preconditioning within BiCGStab 

The results of this paper's benchmarks are 
based on two codes: 

BiCGStab is a sparse matrix Krylov solver 
with regular memory access, where computation 
and communication proceed in an alternating 
fashion. In this case, DMA capabilities of the 
communication cards are not exploited. 



SSOR is the same solver but with local- 
lexicographic SSOR preconditioning. The SSOR 
process leads to rather irregular memory access 
and extensive integer computations. This code is 
very sensitive to the memory-to-cache bandwidth. 
Since communication overlaps with computation, 
DMA can be exploited. 

Both codes are written in C and compiled under 
the GNU egcs-1.1.2 C compiler. Timing was done 
with MPI_Wtime. For both codes there exist two 
versions: 

1. To test single node performance, the code 
runs without communication operations, 
otherwise carrying out exactly the same op- 
erations as the following parallel version. 

2. On the 4-node test machines, the physical 
system is laid out in a 2-D fashion, conse- 
quently, communication is carried out along 
two dimensions, namely z- and t-directions. 
Assuming Nproc — x Nt processors, the 
global lattice is divided in Nt slides in t- 
direction where every slide consists of 
slides in z-direction. 

In the sequel, we are going to employ a local lat- 
tice of size 16^ x4x8on2x2 processors such 
that we emulate a realistic 16"^ x 32 system on 
4x4 processors. 

3.1. Single- node results 

The basic operation in the iterative solution of 
the Dirac matrix is the product of a SU(3) matrix 
with two color vectors. The average number of 
flops per matrix vector operation is -/V/Zop = 171. 
The number of complex words to get from mem- 
ory in order to carry out this process is Ncwords = 
(9-f 2 X 12) leading to N^ytes = 528 bytes for dou- 
ble precision arithmetics. Therefore we expect 
the maximal performance that can be reached for 
a single node to be limited by 

__B_ _ r 97 MFlops(UX) 
~ iVMe ~ I 420 MFlops (DSIO) 

in a steady state of computation and data flow, 
given a maximal memory bandwidth of 300 and 
1300 MB/sec, respectively. Note that our prob- 
lem size is chosen to be larger than the available 
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caches. The real performances will be smaller 
due to BLAS-1 and BLAS-2 operations within 
BiCGStab. 





double prcc. 


single prec. 


Benchmark 


UX 


DSIO 


UX 


DSIO 


BiCGStab 


82 


166 


116 


232 


SSOR 


57 


115 


90 


182 



Table 1 

One processor benchmark. Numbers in MFlops. 



Table |l| shows that, on the UX board, the per- 
formance of BiCGStab comes close to the limit- 
ing value given, while the DSIO performance de- 
viates by more than a factor of 2 from the es- 
timate. The local lattice size presumably is too 
small to lead to saturation of the bandwidth for 
the DS1O0. However, as a main result, we find 
that the improvement in performance going from 
the 533 MHz Alpha 21164 to the 466 MHz Alpha 
21264 chip is around a factor of two, using identi- 
cal codes. Furthermore, the SSOR preconditioner 
with irregular memory access is, as has been ex- 
pected, less effective than the simple BiCGStab. 

3.2. Four-node results 

The impact of interprocessor communication 
for both connectivities is determined on the four- 
node testbed systems. 



Ethernet mesh (denoted by UX) are disappoint- 
ing. The performance of both codes, SSOR and 
BiCGStab, is reduced by more than a factor of 
two compared to the single node result. The main 
degradations are due to the massive protocol 
overhead forcing the processor into administra- 
tion instead of computation. User-level network- 
ing interfaces promise to circumvent this problem 
in the near future, but are currently not available 
for our configuration. 

It is satisfying to see, by comparing Tables |^ 
and |, that the Alpha 21264-Myrinet system (de- 
noted by DSIO) with Myrinet GM library has a 
communication loss in the range of only 10 to 
20 %. We expect a further considerable improve- 
ment of these results by employing software with 
reduced protocol stack like SCore |^ or ParaSta- 
tion Q. 

4. Conclusion 

Comparing price/performance ratios we arrive 
at the following estimates: An Alpha 21164 sys- 
tem, connected in a fast Ethernet mesh, would — 
as an optimistic estimate — lead to a 4 GFlops de- 
vice (sustained) for 128 processors with a price of 
about 80 k$ per GFlops. 

A 128 processor DSIO Alpha-Linux-Cluster 
connected by Myrinet, however, promises to re- 
duce costs to 40 - 50 k$ per GFlops (estimated 
from list prices as of July 1999) and is therefore 
in the range of state-of-the-art dedicated QCD 
machines. 
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double prcc. 


single prec. 


Benchmark 


UX 


DSIO 


UX 


DSIO 


BiCGStab 


32 


130 


54 


201 


SSOR 


30 


100 


53 


164 



Table 2 

Four processor benchmark. Numbers in MFlops. 



As shown in Table H, the results for the fast 



2 The STREAMS benchmark [g gives a real bandwidth of 
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